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1. INTRODUCTION 

Speech recognition techniques present a very wide field of research with diverse applications such 
as speech impairment, improving the accuracy of voice command fingerprinting attacks and more, as 
discussed in [1]-[3]. One of the most representative research projects is in emotion recognition using 
spectrograms, mel frequency cepstral coefficient (MFCC) and convolutional networks [4], [5]. Cases such as 
the one presented in [6] employ convolutional networks with 3-dimensional inputs based on the first and 
second derivative of the spectrogram. 

Interaction by voice commands with robots is another field of research interest in speech recognition 
[7]-[9]. The use of voice assistants such as Amazon's Alexa [10] or Google's [11], allow to obtain a more 
natural method of human-robot interaction. Thus, highlighting that voice commands are a necessity in 
interaction with robots [12], where for this research also the use of convolutional networks provides high 
performance [13]-[15]. 

Nowadays, the development of intelligent environments is gaining strength, including smart homes 
[16]. Robotic technology has been included in these environments with different fronts such as people care 
[17], cleaning [18] and even cooking [19]. In [20] and [21], the development of assistive robots to address 
patient isolations by COVID is exposed, however, the development is oriented to systems telecommanded by 
cellular mobile equipment. 

Given the relevance of this topic and the need for a more natural and autonomous telecommand 
system, this work presents an audio command recognition system oriented to an assistive robot in a 
residential environment, thus integrating what has been found in the state of the art by means of a voice 
assistant for robotic action, performing audio preprocessing by ceptral coefficients and subsequent 
recognition by means of convolutional networks. As a contribution to the state of the art, a neuro- 


Journal homepage: http://ijai.iaescore.com 


586 0 ISSN: 2252-8938 


convolutional architecture is designed to be easily embedded in portable electronic systems, performing the 
separation of the command words by means of a sliding window that calculates the power density of the 
audio signal. 

The document is divided into four sections, the present section exposes the state of the art related to 
the work developed. Section two presents the methodology used for the separation of the words that make up 
each command and the neural training. Section 3 presents the analysis of the results achieved and finally 
section four presents the conclusions reached. 


2. METHOD 

The proposed objective is to use voice commands consisting of groups of three words, which allow 
the execution of assistive actions of a mobile robot within a residential environment. The sequence of control 
words is recorded and each one of them is separated to obtain a two-dimensional map of each audio signal, 
by means of mel frequency cepstral coefficients (MFCCs). Each map is classified using a convolutional 
neural network and the coherence of the command to execute the action is validated. The general scheme is 
presented in Figure 1. 


Try again 


Figure 1. Flowchart of the experimental methods applied 


The training of the model is performed by creating a database consisting of 3600 recordings of 
different users, distributed in 8 classes corresponding to robot, bring, carry, stop, paper, cup, towel, and 
medicine, where 80% are taken for training and 20% for validation. Each audio is acquired with a sampling 
frequency of 16000 Hz and each word is separated by the location of the minima found when obtaining the 
absolute value of the original signal, as shown in Figure 2. By means of a sliding window of ten times the 
input frequency, the power density of the audio signal is calculated, each one is compared with the previous 
value and if it presents a decay of 75% it is established as a local minimum, a point used for the separation of 
each word, where the asterisks on the right side represent the minimum values found. 
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Figure 2. Minimal detection for words separation 


The database is made with 10 male and 10 female users, to diversify the learning. Figure 3 shows 
examples of the database. Figure 3(a) shows an example of a woman's voice and Figure 3(b) shows an 


Int J Artif Intell, Vol. 12, No. 2, June 2023: 585-592 


Int J Artif Intell ISSN: 2252-8938 o 587 


example of a man's voice, both diverge in amplitude and frequency spectrum. The original database is taken 
in Spanish language. 
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Figure 3. Command robot carrying medicine, (a) Female voice and (b) Male voice 


Each of the three words is preprocessed for feature extraction to obtain a two-dimensional, three- 
channel map, which allows a convolutional neural network [22] to learn the behavior of the voice command 
over time, to be recognized. The feature map is obtained by calculating the mel frequency cepstral 
coefficients (MFCCs) using (1) to (3). These are coefficients for speech representation based on human 
auditory perception [23], widely used in speech analysis systems [24]. 
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By means of | it is possible to generate a feature map of 12 coefficients acquired from 199 frames. 
Being this the first input channel to the network, the first and second derivative (2 and 3 respectively), 
generate the other two channels. Therefore, the learning input to the network is of dimensions 12x199x3. 

The network architecture used is shown in Table 1, employing six convolution layers, given the 
limited number of desired outputs and the punctual work to be performed by the network. The training 
hyperparameters were found iteratively using a learning rate of le-6, with 50 epochs. Figure 4 illustrates the 
network learning process, with a training time of 31 minutes for 79250 iterations, on a 2.30GHz Intel Core 17 
computer with NVIDA Gforce RTX 3070 8GB GPU, and finally a performance of 96.9%. 


Table 1. CNN architecture used 
Input: 12x199x3 


Layer Kernel Filters Padding _ Stride 
Convolution 7 32 2 1 
BatchNorm 
Convolution 5 32 2 1 
MaxPooling 2 - 0 2 
Convolution 3 64 1 1 
Convolution 3 64 1 1 
MaxPooling 2 - 0 2 
Convolution 2 128 1 1 
Convolution 3 128 1 1 
MaxPooling 2 - 0 2 
Fully-Conn - 512 - - 
Dropout - - - - 
Fully-Conn - 2048 - - 
Dropout - - - - 
Fully-Conn - 8 - - 
Softmax 
Classification 
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Figure 4. Network accuracy 
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Figure 5 illustrates the resulting confusion matrix, where it is possible to see the good behavior in 
the recognition of the words among themselves. Only the category "stop" showing a significant percentage of 
confusion with the category "towel”. The average time for the classification of each word by the network is 
0.6 seconds, where each word is submitted to the network for classification, generating an average response 


time of 1.8 seconds. 
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Figure 5. final confusion matrix 


3. RESULTS AND DISCUSSION 
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The algorithm is validated by evaluating the action commands to be developed by the mobile robot. 
For this purpose, the variants of the commands are established according to the words to be recognized as 
shown in Table 2. The number of true positives versus false positives that the algorithm exhibits is 
determined. A true positive corresponds to a valid command of the desired action, a false positive 


corresponds to a valid command, but not according to the desired action. 
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Table 2. Valid commands 


Command True positives False positives 

Robot bring glass 16 1 
Robot bring paper 18 2 
Robot bring towel 16 1 
Robot bring medicine 20 5 
Robot carrying glass 17 0 
Robot carrying paper 18 3 
Robot carrying towel 17 0 
Robot carrying medicine 20 6 


The algorithm filters by software the validity of a command initially evaluating the existence of the 
three words, in this case by means of the minima of the signal spectrum. Figure 6 illustrates the case two 
examples of the commands "robot bring glass" shows in Figure 6(a) and "robot carry paper" shows in 
Figure 6(b) with the location of the minima that result in the division of the words, where the difference 
between each command is appreciated. The first word must always be robot, otherwise it will not be 
validated. From Table 2, it is possible to derive an efficiency of 88.75% in the discrimination of the 
commands to the robot, where the characteristics of confusion of classes between carry and bring stand out: 
16.6% of false positives and 55.5% of false positives correspond to confusing the object (glass, paper and 
towel) with the class medicine. 
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Figure 6. Example commands, (a) robot bring glass and (b) robot carry paper 


Figure 7 illustrates the results of the prediction of the network by discriminating each word and 
evaluating it, to the right of each separated word there is evidence of noise generated by complementing the 
size of the information vector. This is because the duration of the input to the network is 2 seconds, which at 
a sampling frequency of 16000 Hz implies 32000 samples. When trimming each word, the vector is 
shortened and, since it cannot be filled in in a concerted manner, due to the MFCC derivatives, a random 
filling of +0.01 is generated. Figure 7(a) and Figure 7(b) illustrate two examples of different commands from 
the original signal to their separation into words and recognition of each (robot bring paper and robot carry 
paper respectively). 
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Figure 7. Commands correctly recognized, (a) robot bring paper and (b) robot carry paper 


In contrast to Figure 7, Figure 8 illustrates a case of erroneous detection in the action command 
"robot carry paper". The similarity of the first and second word is evidenced, varying mainly in amplitude. So 
it is recognized as the same word generating in the network the output "robot robot paper", which is 
classified as an invalid command. In this case, the error was associated to environmental noise at the time of 
recording the command. 
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Figure 8. Error in detecting robot carrying paper 


Similar work is presented in [25], where the robotic action commands also employ word separation 
and generate feature extraction by MFCC, using a single channel, but combining the CNN with an LSTM 
network, they report up to 90.37% accurracy for word recognition. The 6% improvement achieved by the 
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CNN network developed in this work is due to a higher number of training audios and the use of the MFCC 
derivatives of each word. It is validated that the use of 7x7 filters in the first convolutional layer, instead of 
5x5 as in [25], also helped to improve the accuracy by 3%. 

A virtual environment was designed for evaluation of the robotic navigation task and voice 
command discrimination, as shown in Figure 9. The response time of the robot in discriminating the actions 
is about 8 seconds. This time include the robot responses of valid command and identification of the place 
where the desired action will be carried out in the residential environment. 
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Figure 9. Virtual test environment 


4. CONCLUSION 

The use of convolutional networks for voice command generation in mobile robotics offers a natural 
field of human-machine interaction. It is concluded that MFCC discrimination allows to generate a map of 
recognizable features by the network that results in a functional voice assistant for robotic command. It was 
found that speech recognition accuracy depends on a low ambient noise factor and on generating an adequate 
vocalization of each word. This factor decreases its incidence when enlarging the database with background 
noise and varying the speed and volume of pronunciation. It was concluded from the training of the network 
that the use of few classes facilitates the discrimination of the spoken word, suggesting that future training 
can use identification trees, for example, a network of identification of actions (verbs) and one of objects 
(nouns), to expand the number of commands received by the robot. 
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